September 22, 2016
Remove R Markdown (or add notebook)
Add some tidyr
Add one other exercise at the end
All source code at https://github.com/rladies/rtp-data-manipulation-09-22-16
Slides at [???]
R: Statistical programming language
Both are free and open-source
Install R: https://cran.r-project.org/
Install RStudio: https://www.rstudio.com/products/RStudio/#Desktop
R Splash Screen
The version of R is text that pops up in the Console when you start RStudio
To find out the version of RStudio go to Help \(\rightarrow\) About RStudio
It's good practice to keep both R and RStudio up to date
Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and (often) sample data. (From: http://r-pkgs.had.co.nz)
We will use the ggplot2 package for plots and dplyr for data wrangling in this workshop
Install these packages by running the following in the Console:
install.packages("readr")
install.packages("tidyr")
install.packages("dplyr")
library(readr) library(tidyr) library(dplyr)
Do not just type code in the Console, it's error prone (especially typo prone!) and hard to keep track of
R Markdown: File -> New File -> R Markdown
Type your code in the R Script
Use curser + Run or highlight + Run
Shortcut for Run button: Command + Enter
Use # for comments
[ Demo ]
R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R.
It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document.
R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes).
Source: http://rmarkdown.rstudio.com/
[ Demo ]
Markdown is a simple formatting language designed to make authoring content easy for everyone.
Rather than writing complex markup code (e.g. HTML or LaTeX), Markdown enables the use of a syntax much more like plain-text email.
Markdown overview
Within an R Markdown file, R Code Chunks can be embedded using the native Markdown syntax for fenced code regions.
Comments can go in the chunks wih # or just as plain text outside the chunks
Code chunks
Open a new R Markdown documen and Knit it
How many code chunks are in your R Markdown document?
What does each code chunk do? You may not understand the R syntax yet, but you should be able to compare the source file and the output to answer this question.
R Markdown workspace and Console workspace are independent of each other
If you define a variable in your Console and it shows up in the Environment tab, it is not going to be automatically included in your R Markdown document
If you define a variable in your R Markdown document, it won't automatically be available in your Console
Tip: Use the Run all previous chunks in the source file and Run current chunk code functionality in the buttons in each code chunk to help manage workspaces.
bike <- read_csv2("https://stat.duke.edu/~mc301/data/nc_bike_crash.csv",
na = c("NA", "", "."))
## Parsed with column specification: ## cols( ## .default = col_character(), ## FID = col_integer(), ## OBJECTID = col_integer(), ## Bike_Age = col_integer(), ## Crash_Date = col_date(format = ""), ## Crash_Hour = col_integer(), ## Crash_Time = col_datetime(format = ""), ## Crash_Ty_1 = col_integer(), ## Crash_Year = col_integer(), ## Drvr_Age = col_integer(), ## Num_Units = col_integer() ## )
## See spec(...) for full column specifications.
View the names of variables via
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
and see detailed descriptions at https://stat.duke.edu/~mc301/data/nc_bike_crash.html.
You might be familiar with the read.csv() function for loading data from a csv file into R.
If using the read.csv() function R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, sometimes it isn't -– either way it is important to know what type/class you are working with.
This behavior can be changed using the stringsAsFactors = FALSE when loading a data drame.
## Warning in instance$preRenderHook(instance): It seems your data is too ## big for client-side DataTables. You may consider server-side processing: ## http://rstudio.github.io/DT/server.html
str() function to compactly display the internal structure of an R objectstr(bike)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5716 obs. of 54 variables: ## $ FID : int 18 29 33 35 49 53 56 60 63 66 ... ## $ OBJECTID : int 19 30 34 36 50 54 57 61 64 67 ... ## $ AmbulanceR: chr "No" "Yes" "No" "Yes" ... ## $ BikeAge_Gr: chr NA "50-59" NA "16-19" ... ## $ Bike_Age : int 6 51 10 17 6 52 18 40 6 7 ... ## $ Bike_Alc_D: chr "No" "No" "No" "No" ... ## $ Bike_Dir : chr "Not Applicable" "With Traffic" "With Traffic" NA ... ## $ Bike_Injur: chr "C: Possible Injury" "C: Possible Injury" "Injury" "B: Evident Injury" ... ## $ Bike_Pos : chr "Driveway / Alley" "Travel Lane" "Travel Lane" "Travel Lane" ... ## $ Bike_Race : chr "Black" "Black" "Black" "White" ... ## $ Bike_Sex : chr "Female" "Male" "Male" "Male" ... ## $ City : chr "Durham" "Greenville" "Farmville" "Charlotte" ... ## $ County : chr "Durham" "Pitt" "Pitt" "Mecklenburg" ... ## $ CrashAlcoh: chr "No" "No" "No" "No" ... ## $ CrashDay : chr "01-01-06" "01-01-02" "01-01-07" "01-01-05" ... ## $ Crash_Date: Date, format: "2007-01-06" "2007-01-09" ... ## $ Crash_Grp : chr "Bicyclist Failed to Yield - Midblock" "Crossing Paths - Other Circumstances" "Bicyclist Failed to Yield - Sign-Controlled Intersection" "Loss of Control / Turning Error" ... ## $ Crash_Hour: int 13 23 16 19 12 20 19 14 16 0 ... ## $ Crash_Loc : chr "Non-Intersection" "Intersection-Related" "Intersection" "Intersection" ... ## $ Crash_Mont: chr NA NA NA NA ... ## $ Crash_Time: POSIXct, format: "0001-01-01 13:17:58" "0001-01-01 23:08:58" ... ## $ Crash_Type: chr "Bicyclist Ride Out - Residential Driveway" "Crossing Paths - Intersection - Other /" "Bicyclist Ride Through - Sign-Controlled Intersection" "Motorist Lost Control - Other /" ... ## $ Crash_Ty_1: int 353311 211180 111144 119139 112114 311231 119144 132180 112142 460910 ... ## $ Crash_Year: int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ... ## $ Crsh_Sevri: chr "C: Possible Injury" "C: Possible Injury" "O: No Injury" "B: Evident Injury" ... ## $ Developmen: chr "Residential" "Commercial" "Residential" "Residential" ... ## $ DrvrAge_Gr: chr "60-69" "30-39" "50-59" "30-39" ... ## $ Drvr_Age : int 66 34 52 33 NA 20 40 NA 17 51 ... ## $ Drvr_Alc_D: chr "No" "No" "No" "No" ... ## $ Drvr_EstSp: chr "11-15 mph" "0-5 mph" "21-25 mph" "46-50 mph" ... ## $ Drvr_Injur: chr "O: No Injury" "O: No Injury" "O: No Injury" "O: No Injury" ... ## $ Drvr_Race : chr "Black" "Black" "White" "White" ... ## $ Drvr_Sex : chr "Male" "Male" "Female" "Female" ... ## $ Drvr_VehTy: chr "Pickup" "Passenger Car" "Passenger Car" "Sport Utility" ... ## $ ExcsSpdInd: chr "No" "No" "No" "No" ... ## $ Hit_Run : chr "No" "No" "No" "No" ... ## $ Light_Cond: chr "Daylight" "Dark - Lighted Roadway" "Daylight" "Dark - Roadway Not Lighted" ... ## $ Locality : chr "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" "Mixed (30% To 70% Developed)" "Urban (>70% Developed)" ... ## $ Num_Lanes : chr "2 lanes" "5 lanes" "2 lanes" "4 lanes" ... ## $ Num_Units : int 2 2 2 3 2 2 2 2 2 2 ... ## $ Rd_Charact: chr "Straight - Level" "Straight - Level" "Straight - Level" "Straight - Level" ... ## $ Rd_Class : chr "Local Street" "Local Street" "Local Street" "NC Route" ... ## $ Rd_Conditi: chr "Dry" "Dry" "Dry" "Dry" ... ## $ Rd_Config : chr "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" "Two-Way, Not Divided" "Two-Way, Divided, Unprotected Median" ... ## $ Rd_Defects: chr "None" "None" "None" "None" ... ## $ Rd_Feature: chr "No Special Feature" "Four-Way Intersection" "Four-Way Intersection" "Four-Way Intersection" ... ## $ Rd_Surface: chr "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" "Smooth Asphalt" ... ## $ Region : chr "Piedmont" "Coastal" "Coastal" "Piedmont" ... ## $ Rural_Urba: chr "Urban" "Urban" "Rural" "Urban" ... ## $ Speed_Limi: chr "20 - 25 MPH" "40 - 45 MPH" "30 - 35 MPH" "40 - 45 MPH" ... ## $ Traff_Cntr: chr "No Control Present" "Stop And Go Signal" "Stop Sign" "Stop And Go Signal" ... ## $ Weather : chr "Clear" "Clear" "Clear" "Cloudy" ... ## $ Workzone_I: chr "No" "No" "No" "No" ... ## $ Location : chr "36.002743, -78.8785" "35.612984, -77.39265" "35.595676, -77.59074" "35.076767, -80.7728" ... ## - attr(*, "spec")=List of 2 ## ..$ cols :List of 54 ## .. ..$ FID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ OBJECTID : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ AmbulanceR: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ BikeAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Bike_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Dir : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Pos : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Bike_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ City : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ County : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashAlcoh: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ CrashDay : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Date:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_date" "collector" ## .. ..$ Crash_Grp : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Hour: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Loc : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Mont: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Time:List of 1 ## .. .. ..$ format: chr "" ## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector" ## .. ..$ Crash_Type: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Crash_Ty_1: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crash_Year: list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Crsh_Sevri: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Developmen: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ DrvrAge_Gr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Age : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Drvr_Alc_D: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_EstSp: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Injur: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Race : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_Sex : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Drvr_VehTy: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ ExcsSpdInd: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Hit_Run : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Light_Cond: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Locality : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Lanes : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Num_Units : list() ## .. .. ..- attr(*, "class")= chr "collector_integer" "collector" ## .. ..$ Rd_Charact: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Class : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Conditi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Config : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Defects: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Feature: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rd_Surface: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Region : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Rural_Urba: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Speed_Limi: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Traff_Cntr: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Weather : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Workzone_I: list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## .. ..$ Location : list() ## .. .. ..- attr(*, "class")= chr "collector_character" "collector" ## ..$ default: list() ## .. ..- attr(*, "class")= chr "collector_guess" "collector" ## ..- attr(*, "class")= chr "col_spec"
glimpse(bike)
## Observations: 5,716 ## Variables: 54 ## $ FID <int> 18, 29, 33, 35, 49, 53, 56, 60, 63, 66, 72, 75, 82,... ## $ OBJECTID <int> 19, 30, 34, 36, 50, 54, 57, 61, 64, 67, 73, 76, 83,... ## $ AmbulanceR <chr> "No", "Yes", "No", "Yes", "No", "Yes", "Yes", "No",... ## $ BikeAge_Gr <chr> NA, "50-59", NA, "16-19", NA, "50-59", "16-19", "40... ## $ Bike_Age <int> 6, 51, 10, 17, 6, 52, 18, 40, 6, 7, 45, 30, 17, 20,... ## $ Bike_Alc_D <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Bike_Dir <chr> "Not Applicable", "With Traffic", "With Traffic", N... ## $ Bike_Injur <chr> "C: Possible Injury", "C: Possible Injury", "Injury... ## $ Bike_Pos <chr> "Driveway / Alley", "Travel Lane", "Travel Lane", "... ## $ Bike_Race <chr> "Black", "Black", "Black", "White", "Black", "White... ## $ Bike_Sex <chr> "Female", "Male", "Male", "Male", "Male", "Male", "... ## $ City <chr> "Durham", "Greenville", "Farmville", "Charlotte", "... ## $ County <chr> "Durham", "Pitt", "Pitt", "Mecklenburg", "Mecklenbu... ## $ CrashAlcoh <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ CrashDay <chr> "01-01-06", "01-01-02", "01-01-07", "01-01-05", NA,... ## $ Crash_Date <date> 2007-01-06, 2007-01-09, 2007-01-14, 2007-01-12, 20... ## $ Crash_Grp <chr> "Bicyclist Failed to Yield - Midblock", "Crossing P... ## $ Crash_Hour <int> 13, 23, 16, 19, 12, 20, 19, 14, 16, 0, 17, 18, 14, ... ## $ Crash_Loc <chr> "Non-Intersection", "Intersection-Related", "Inters... ## $ Crash_Mont <chr> NA, NA, NA, NA, NA, "01-04-01", "01-04-01", NA, "01... ## $ Crash_Time <dttm> 0001-01-01 13:17:58, 0001-01-01 23:08:58, 0001-01-... ## $ Crash_Type <chr> "Bicyclist Ride Out - Residential Driveway", "Cross... ## $ Crash_Ty_1 <int> 353311, 211180, 111144, 119139, 112114, 311231, 119... ## $ Crash_Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200... ## $ Crsh_Sevri <chr> "C: Possible Injury", "C: Possible Injury", "O: No ... ## $ Developmen <chr> "Residential", "Commercial", "Residential", "Reside... ## $ DrvrAge_Gr <chr> "60-69", "30-39", "50-59", "30-39", NA, "20-24", "4... ## $ Drvr_Age <int> 66, 34, 52, 33, NA, 20, 40, NA, 17, 51, NA, 64, 50,... ## $ Drvr_Alc_D <chr> "No", "No", "No", "No", "Missing", "No", "No", "Mis... ## $ Drvr_EstSp <chr> "11-15 mph", "0-5 mph", "21-25 mph", "46-50 mph", "... ## $ Drvr_Injur <chr> "O: No Injury", "O: No Injury", "O: No Injury", "O:... ## $ Drvr_Race <chr> "Black", "Black", "White", "White", "/Missing", "Wh... ## $ Drvr_Sex <chr> "Male", "Male", "Female", "Female", NA, "Female", "... ## $ Drvr_VehTy <chr> "Pickup", "Passenger Car", "Passenger Car", "Sport ... ## $ ExcsSpdInd <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Hit_Run <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "... ## $ Light_Cond <chr> "Daylight", "Dark - Lighted Roadway", "Daylight", "... ## $ Locality <chr> "Mixed (30% To 70% Developed)", "Urban (>70% Develo... ## $ Num_Lanes <chr> "2 lanes", "5 lanes", "2 lanes", "4 lanes", "2 lane... ## $ Num_Units <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ... ## $ Rd_Charact <chr> "Straight - Level", "Straight - Level", "Straight -... ## $ Rd_Class <chr> "Local Street", "Local Street", "Local Street", "NC... ## $ Rd_Conditi <chr> "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "Dry", "D... ## $ Rd_Config <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unprotec... ## $ Rd_Defects <chr> "None", "None", "None", "None", "None", "None", "No... ## $ Rd_Feature <chr> "No Special Feature", "Four-Way Intersection", "Fou... ## $ Rd_Surface <chr> "Smooth Asphalt", "Smooth Asphalt", "Smooth Asphalt... ## $ Region <chr> "Piedmont", "Coastal", "Coastal", "Piedmont", "Pied... ## $ Rural_Urba <chr> "Urban", "Urban", "Rural", "Urban", "Urban", "Urban... ## $ Speed_Limi <chr> "20 - 25 MPH", "40 - 45 MPH", "30 - 35 MPH", "40... ## $ Traff_Cntr <chr> "No Control Present", "Stop And Go Signal", "Stop S... ## $ Weather <chr> "Clear", "Clear", "Clear", "Cloudy", "Clear", "Clea... ## $ Workzone_I <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No... ## $ Location <chr> "36.002743, -78.8785", "35.612984, -77.39265", "35....
Using base R functions
Using the tidyr and dplyr packages \(\leftarrow\) our focus today
Using a variety of other packages from the Tidyverse like plyr, lubridate, etc.
dplyrThe dplyr package is based on the concepts of functions as verbs that manipulate data frames:
filter(): pick rows matching criteriaselect(): pick columns by namerename(): rename specific columnsarrange(): reorder rowsmutate(): add new variablestransmute(): create new data frame with variablessample_n() / sample_frac(): randomly sample rowssummarise(): reduce variables to valuesdplyr rulesfilter()filter()for crashes in Durham County
bike %>% filter(County == "Durham")
## # A tibble: 253 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 53 54 Yes 50-59 52 No With Traffic ## 3 56 57 Yes 16-19 18 No <NA> ## 4 209 210 No 16-19 16 No Facing Traffic ## 5 228 229 Yes 40-49 40 No With Traffic ## 6 620 621 Yes 50-59 55 No With Traffic ## 7 667 668 Yes 60-69 61 No Not Applicable ## 8 458 459 Yes 60-69 62 No With Traffic ## 9 576 577 No 40-49 49 No With Traffic ## 10 618 619 No 20-24 23 No With Traffic ## # ... with 243 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
filter()for crashes in Durham County where biker was < 10 yrs old
bike %>% filter(County == "Durham", Bike_Age < 10)
## # A tibble: 20 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 47 48 No 10-Jun 9 No Not Applicable ## 3 124 125 Yes 10-Jun 8 No With Traffic ## 4 531 532 Yes 10-Jun 7 No With Traffic ## 5 704 705 Yes 10-Jun 9 No Not Applicable ## 6 42 43 No 10-Jun 8 No With Traffic ## 7 392 393 Yes 0-5 2 No Not Applicable ## 8 941 942 No 10-Jun 9 No With Traffic ## 9 436 437 Yes 10-Jun 6 No Not Applicable ## 10 160 161 Yes 10-Jun 7 No With Traffic ## 11 273 274 Yes 10-Jun 7 No Facing Traffic ## 12 78 79 Yes 10-Jun 7 No With Traffic ## 13 422 423 No 10-Jun 9 No Not Applicable ## 14 570 571 No <NA> 0 Missing Not Applicable ## 15 683 684 Yes 10-Jun 8 No Not Applicable ## 16 62 63 Yes 10-Jun 7 No With Traffic ## 17 248 249 No 0-5 4 No Not Applicable ## 18 306 307 Yes 10-Jun 8 No With Traffic ## 19 231 232 Yes 10-Jun 8 No With Traffic ## 20 361 362 Yes 10-Jun 9 No With Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
| operator | definition |
|---|---|
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
== |
exactly equal to |
!= |
not equal to |
x | y |
x OR y |
x & y |
x AND y |
| operator | definition |
|---|---|
is.na(x) |
test if x is NA |
!is.na(x) |
test if x is not NA |
x %in% y |
test if x is in y |
!(x %in% y) |
test if x is not in y |
!x |
not x |
What in the world does a BikeAge_gr of 10-Jun or 15-Nov mean?
bike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n())
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 0-5 60 ## 2 10-Jun 421 ## 3 15-Nov 747 ## 4 16-19 605 ## 5 20-24 680 ## 6 25-29 430 ## 7 30-39 658 ## 8 40-49 920 ## 9 50-59 739 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
10-Jun should be 6-1015-Nov should be 11-15stringrstringrinstall.packages("stringr")
library(stringr)
str_replace() and add new variables with mutate()BikeAge_Gr variable: 10-Jun should be 6-10 and 15-Nov should be 11-15bike <- bike %>% mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "10-Jun", "6-10")) %>% mutate(BikeAge_Gr = str_replace(BikeAge_Gr, "15-Nov", "11-15"))
Always check your changes and confirm code did what you wanted it to do
bike %>% group_by(BikeAge_Gr) %>% summarise(count = n())
## # A tibble: 13 × 2 ## BikeAge_Gr count ## <chr> <int> ## 1 0-5 60 ## 2 11-15 747 ## 3 16-19 605 ## 4 20-24 680 ## 5 25-29 430 ## 6 30-39 658 ## 7 40-49 920 ## 8 50-59 739 ## 9 6-10 421 ## 10 60-69 274 ## 11 70 12 ## 12 70+ 58 ## 13 <NA> 112
slice() for certain row numbersFirst five
bike %>% slice(1:5)
## # A tibble: 5 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 18 19 No <NA> 6 No Not Applicable ## 2 29 30 Yes 50-59 51 No With Traffic ## 3 33 34 No <NA> 10 No With Traffic ## 4 35 36 Yes 16-19 17 No <NA> ## 5 49 50 No <NA> 6 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
slice() for certain row numbersLast five
last_row <- nrow(bike) bike %>% slice((last_row-4):last_row)
## # A tibble: 5 × 54 ## FID OBJECTID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <int> <chr> <chr> <int> <chr> <chr> ## 1 460 461 Yes 6-10 7 No Not Applicable ## 2 474 475 Yes 50-59 50 No With Traffic ## 3 479 480 Yes 16-19 16 No Not Applicable ## 4 487 488 No 40-49 47 Yes With Traffic ## 5 488 489 Yes 30-39 35 No Facing Traffic ## # ... with 47 more variables: Bike_Injur <chr>, Bike_Pos <chr>, ## # Bike_Race <chr>, Bike_Sex <chr>, City <chr>, County <chr>, ## # CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, Crash_Grp <chr>, ## # Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
select() to keep only the variables you mentionbike %>% select(Crash_Loc, Hit_Run) %>% table()
## Hit_Run ## Crash_Loc No Yes ## Intersection 2223 275 ## Intersection-Related 252 42 ## Location 3 7 ## Non-Intersection 2213 462 ## Non-Roadway 205 30
select()to exclude variablesbike %>% select(-OBJECTID)
## # A tibble: 5,716 × 53 ## FID AmbulanceR BikeAge_Gr Bike_Age Bike_Alc_D Bike_Dir ## <int> <chr> <chr> <int> <chr> <chr> ## 1 18 No <NA> 6 No Not Applicable ## 2 29 Yes 50-59 51 No With Traffic ## 3 33 No <NA> 10 No With Traffic ## 4 35 Yes 16-19 17 No <NA> ## 5 49 No <NA> 6 No Facing Traffic ## 6 53 Yes 50-59 52 No With Traffic ## 7 56 Yes 16-19 18 No <NA> ## 8 60 No 40-49 40 No Facing Traffic ## 9 63 Yes 6-10 6 No Facing Traffic ## 10 66 Yes 6-10 7 No <NA> ## # ... with 5,706 more rows, and 47 more variables: Bike_Injur <chr>, ## # Bike_Pos <chr>, Bike_Race <chr>, Bike_Sex <chr>, City <chr>, ## # County <chr>, CrashAlcoh <chr>, CrashDay <chr>, Crash_Date <date>, ## # Crash_Grp <chr>, Crash_Hour <int>, Crash_Loc <chr>, Crash_Mont <chr>, ## # Crash_Time <dttm>, Crash_Type <chr>, Crash_Ty_1 <int>, ## # Crash_Year <int>, Crsh_Sevri <chr>, Developmen <chr>, ## # DrvrAge_Gr <chr>, Drvr_Age <int>, Drvr_Alc_D <chr>, Drvr_EstSp <chr>, ## # Drvr_Injur <chr>, Drvr_Race <chr>, Drvr_Sex <chr>, Drvr_VehTy <chr>, ## # ExcsSpdInd <chr>, Hit_Run <chr>, Light_Cond <chr>, Locality <chr>, ## # Num_Lanes <chr>, Num_Units <int>, Rd_Charact <chr>, Rd_Class <chr>, ## # Rd_Conditi <chr>, Rd_Config <chr>, Rd_Defects <chr>, Rd_Feature <chr>, ## # Rd_Surface <chr>, Region <chr>, Rural_Urba <chr>, Speed_Limi <chr>, ## # Traff_Cntr <chr>, Weather <chr>, Workzone_I <chr>, Location <chr>
rename() specific columnsCorrect typos and rename to make variable names shorter and/or more informative
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limi" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
Speed_Limi to Speed_Limit:bike <- bike %>% rename(Speed_Limit = Speed_Limi)
Always check your changes and confirm code did what you wanted it to do
names(bike)
## [1] "FID" "OBJECTID" "AmbulanceR" "BikeAge_Gr" "Bike_Age" ## [6] "Bike_Alc_D" "Bike_Dir" "Bike_Injur" "Bike_Pos" "Bike_Race" ## [11] "Bike_Sex" "City" "County" "CrashAlcoh" "CrashDay" ## [16] "Crash_Date" "Crash_Grp" "Crash_Hour" "Crash_Loc" "Crash_Mont" ## [21] "Crash_Time" "Crash_Type" "Crash_Ty_1" "Crash_Year" "Crsh_Sevri" ## [26] "Developmen" "DrvrAge_Gr" "Drvr_Age" "Drvr_Alc_D" "Drvr_EstSp" ## [31] "Drvr_Injur" "Drvr_Race" "Drvr_Sex" "Drvr_VehTy" "ExcsSpdInd" ## [36] "Hit_Run" "Light_Cond" "Locality" "Num_Lanes" "Num_Units" ## [41] "Rd_Charact" "Rd_Class" "Rd_Conditi" "Rd_Config" "Rd_Defects" ## [46] "Rd_Feature" "Rd_Surface" "Region" "Rural_Urba" "Speed_Limit" ## [51] "Traff_Cntr" "Weather" "Workzone_I" "Location"
summarise() in a new data framebike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(crash_count)
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 70 12 ## 2 70+ 58 ## 3 0-5 60 ## 4 <NA> 112 ## 5 60-69 274 ## 6 6-10 421 ## 7 25-29 430 ## 8 16-19 605 ## 9 30-39 658 ## 10 20-24 680 ## 11 50-59 739 ## 12 11-15 747 ## 13 40-49 920
arrange() to order rowsbike %>% group_by(BikeAge_Gr) %>% summarise(crash_count = n()) %>% arrange(desc(crash_count))
## # A tibble: 13 × 2 ## BikeAge_Gr crash_count ## <chr> <int> ## 1 40-49 920 ## 2 11-15 747 ## 3 50-59 739 ## 4 20-24 680 ## 5 30-39 658 ## 6 16-19 605 ## 7 25-29 430 ## 8 6-10 421 ## 9 60-69 274 ## 10 <NA> 112 ## 11 0-5 60 ## 12 70+ 58 ## 13 70 12
sample_n() or sample_frac()sample_n(): randomly sample 5 observationsbike_n5 <- bike %>% sample_n(5, replace = FALSE) dim(bike_n5)
## [1] 5 54
sample_frac(): randomly sample 20% of observationsbike_perc20 <-bike %>% sample_frac(0.2, replace = FALSE) dim(bike_perc20)
## [1] 1143 54
dplyr resourcesVisit https://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html for the package vignette.
Refer to the dplyr cheatsheet.
For when not working with dplyr or ggplot2
Refer to a variable in a dataset as bike$Crash_Loc
Access any element in a dataframe using square brackets
bike[1,5] # row 1, column 5
## # A tibble: 1 × 1 ## Bike_Age ## <int> ## 1 6
- For all observations in row 1: `bike[1, ]` - For all observations in column 5: `bike[, 5]`
Create a new dataframe that doesn't include observations where Bike_Injur = Injury since it's not clear what this means.
This new dataframe also should include observations in Durham, and where the biker is a teenager (13 to 19 years, inclusive).